Final Report

Final Project
Data Science 2 with R (STAT 301-2)

Author

Jeremy Ferguson

Published

March 13, 2024

Introduction

This project aims to create a predictive model that can estimate the yearly salaries of professional basketball players from the National Basketball Association (NBA). This model uses season statistics from players as its predictors. Other NBA-related factors, such as the location of the player’s team and how long the player has been in the NBA, are also used for the analysis.

This research question is a regression problem: we are trying to predict salary, a continuous outcome variable. Since we are comparing salaries across several decades, we need to make sure our measurements use one inflation rate. Therefore the target variable measures a player’s yearly salary adjusted to 2023 prices using the Consumer Price Index for All Urban Consumers from the US Bureau of Labor Statistics..

I see this research most benefiting NBA players. With this model, players would have a better understanding of what they should expect from contract offers based on their previous performance. The model also helps indicate what factors outside of the players’ control, such as what conference a team plays in, contribute towards their salaries.

Besides its player-specific benefits, this model allows me to explore the NBA computationally. I am a great fan of the NBA (especially my hometown Chicago Bulls), so creating this model has been fun and informative of historical season statistics. Observing what statistics are highly valued when a contract is devised also gives me a deeper understanding of how organizations form championship-contending teams.

Data Overview

Before proceeding to methodologies, it is important to check the quality of our data. Doing so will clearly lay out concerns we have about the raw data, driving how recipes, model comparison, and tuning parameters are selected and modified. Below, I use the entire dataset to explore missing values and our target variable.

Missingness

Table 1 presents the number and percent of missingness for each variable. We can see that all missing observations come from the _percent variables. These are variables that measure the shooting percentages of players.

Table 1: Missingness in dataset

We can interpret this missingness as players who never took a 2-point shot, 3-point shot, or either during a season. The 3-point shot missingness does not bother me since historically players have contributed greatly to a team while never taking a 3-point shot in a season. For example, during his MVP season in 1999-2000, center Shaquille O’Neal did not attempt a single 3-point shot in the regular season or during the Los Angeles Lakers’ championship playoff run. These NAs can be replaced with 0s. We can identify players who contributed greatly to their team without shooting a 3 by filtering observations who only have missingness in the 3-point percentage category.

Those without a 2-point attempt or free-throw attempt concern me, as these players also have 0 or near 0 statistics for every other numerical predictor. Table 2 gives a sample of the lack of data for observations in this group.

Table 2: Sample of those missing field-goal percentages in data

It does not make sense to include these observations in recipes since their values are not a strong representation of how statistics change salaries. Also considering that these observations make up a small part of the dataset, the best choice is to drop them from the dataset.

Target Variable Exploration

We start the univariate analysis of our target variable by looking at a histogram of adjusted salaries. Figure 1 shows this distribution. From the histogram, we can see that the majority of NBA players earn less than five million dollars in a year. The distribution does not seem to have any additional local peaks.

Figure 1

However, our distribution is right skewed. Ideally, we want our outcome variable to have a normal distribution to allow us to apply statistical properties and techniques that require a normality assumption. Reducing skewness will also make predicting values easier in our model. A common transformation for right-skewed data is a log transformation. This transformation will help reduce the skewness and deal with extreme values. The density distribution and boxp lot of adjusted salaries in Figure 2 give another perspective of the right-skewness of the data.

Figure 2

After using the log transformation, our data looks a little left-skewed. Nevertheless, we have reduced the skewness, allowing for easier analysis. Figure 3 shows this left-skewness.

Figure 3

We can reduce the skewness of our data even more by considering a more uncommon transformation. I found that transforming the outcome variable by the 7th root essentially removes the skewness. This transformation is visualized in Figure 4.

Figure 4

While this transformation gives us a distribution close to normal, we also need to consider the interpretability of using this transformation; Explaining the findings of our model to an NBA player or agent with this transformation would be difficult. However, interpretability should not be a large concern for us since we can transform the results of our model back to conventional units. Therefore, with the desire for normality in mind, I decided to transform adjusted salaries by the 7th root.

Methods

Again, this prediction model is a regression problem. I use RMSE as an assessment metric for this project. In this regression analysis, I am most concerned about the accuracy of my models’ estimation; I want to be able to predict NBA salaries as closely as possible, as this is the main value a player or agent is concerned with. RMSE also helps to resolve issues with outliers since the metric penalizes these observations.

Data Splitting

Before splitting, our data set has 13985 observations. This data set is on the smaller side, so a split should lean more towards having many training observations. Of course, we do not want our model to overfit the training data. I chose to do a .75/.25 training-testing split, as I believed this to be a good median proportion for the data. Our training set has 10440 observations, and our testing set has 3483 observations.

Resampling Technique

I chose a v-fold cross-validation as my resampling method. The method has 10 folds repeated 5 times. I chose to use 10 folds to allow for an equilibrium between bias and variance. Again, given our data set is on the smaller side, it is not necessary to explore a larger amount of folds, as a higher value would likely create a very high variance. Repeating 5 times allows us to diminish the noisiness of our data. This technique is particularly useful for the standard errors’ accuracy within our performance metric.

When folding, our model fits on 9396 observations and is assessed on about 1044 observations. I find these numbers to be reasonable; we have a nice number of observations for the model itself and ample observations for assessment. Through data collection, I noticed that each NBA season has about 500 players. So, we can think of this method as assessing with two seasons’ worth of players.

Model Types

My project will use the null model, linear regression model, elastic net model, random forest model, gradient boosted tree model, neural network regression model, and multivariate adaptive regression splines model. A key reason for my choices was to continuously increase flexibility while examining the change in the interpretability of my results. Also, many of the predictor variables have high correlations with each other. A few examples of these correlations are seen in Table 3:

Table 3: Examples of high correlation
term mp fg fga
mp NA 0.8913389 0.8951823
fg 0.8913389 NA 0.9810554
fga 0.8951823 0.9810554 NA

High correlation can lead to unstable estimates and overfitting in my models, so I address this issue through my choices of methodology. An expanded analysis of this observation can be found in the EDA portion of the appendix.

Null and Linear

The null and linear regression models are used mainly as a baseline. These models give a simple, trivial result to my data. We can assume that there are complexities in our data that need to be accounted for to produce the best performance metric. However, having baselines gives us a good indication of whether or not our other models are being hurt by complexity.

Keeping in mind the simplicity of the models, the null and linear regression models do not have hyperparameters to tune.

Elastic Net

The elastic net model will address some of the high correlation and multicollinearity concerns I have with the data; Since the elastic net model uses the lasso and ridge techniques to penalize multicollinearity issues, this model is great for my analysis.

I will tune two hyperparameters for this model. The mixture hyperparameter allows us to test different proportional combinations of the lasso and ridge techniques used for penalization. The penalty hyperparameter allows us to vary the punishment a model incurs for multicollinearity issues.

Random Forest

The random forest model is a nice median between interpretability and flexibility. The model handles overfitting concerns and outliers well. In addition, this model can handle both linear and nonlinear relationships. Exploring the training set, I saw a few predictors that may have a nonlinear relationship with the outcome variable. For example, Figure 5 shows that the blocks predictor does not seem to have a steady, linear trend.

Figure 5

If these visualizations are truly nonlinear, random forests will handle these relationships with little tuning. Concerning tuning, I vary three hyperparameters. First, I vary the number of predictors used for each decision tree. It is important to keep in mind that lower values of this hyperparameter could cause overfitting. Next, I tune the number of decision trees used in the prediction model. Allowing for a larger amount of trees should lead to more robust predictions at the cost of longer computation times. Finally, I vary the minimum number of nodes needed in each node. It is best to keep this value on the lower side, as lower values lead to more complexity required to capture the patterns of our data.

Gradient Boosted Tree

Boosted trees allow for more complexity in our data than random forests, potentially creating more accurate performance metrics. However, the hyperparameters are quite sensitive; If the parameters are not tuned correctly, the results will not be worthwhile. This model will tune the same hyperparameters as in the random forest model. Additionally, boosted trees tunes the rate at which the model learns from previous iterations of itself. Using a lower value for this hyperparameter decreases the chance of overfitting since the weights of each tree are smaller. However, a lower value means a higher number of trees to achieve robust results.

MARS

I am choosing to try the multivariate adaptive regression splines (MARS) model mainly for its ability to deal with nonlinearity while still having high interpretability compared to other models with a heavy nonlinearity focus. However, if our nonlinear data has many sharp changes (say one of our predictor-outcome relationships exhibits multiple peaks and troughs), the piecewise linear segments that MARS uses to capture its nonlinear relationships may not be accurate.

For this model, I am tuning two hyperparameters. The first parameter varies the number of terms that are used in the final prediction model. Increasing this value can allow us to capture more complexity. However, increasing this number too much will result in overfitting. The second hyperparameter varies the degree of the interaction term in the model. Similar to the first parameter, increasing the allowed degrees helps with capturing complexities in the model at the cost of potentially overfitting.

Neural Network

The neural network regression model will provide more nonlinear flexibility than MARS if the intricacy of our data becomes a problem. While the model can create estimations with high accuracy, this process is very time-consuming, with multiple hyperparameters needing to be tuned. Since our data set is on the smaller side, it may be the case that neural networks actually overfit our data.

Three hyperparameters are tuned for this model. First, we vary the weight decay of the model, with higher weights preventing the model from learning overly complex patterns that are specific to the training set. Next, we adjust the number of neurons used in each layer of the model, with more neurons leading to more complexity and more potential for overfitting. Finally, we modify the number of training iterations on the entire training set that the model performs. Following the same pattern as the previous hyperparameters, larger values mean more complexity but more potential for overfitting.

Recipes

For this project, I use four separate recipes to conclude the best-performing model. I want to examine how well various predictors and techniques do for estimating salaries. Here, I give explanations for my recipe decisions.

For all recipes, predictors with zero variance were removed. A predictor with zero variance has no distinguishing factor between observations, therefore not positively contributing to the recipe. To place all predictors on the same scale, we normalize the predictors. Also, we drop predictors that have high correlations with other predictors. Again, high correlations can lead to multicollinearity, resulting in overfitting of the model. In general, correlations over .7 are considered high, so we remove any predictors that have correlations above this threshold.

General

The first recipe focuses on general player statistics, such as points, assists, and rebounds. These are the predictors that average person thinks of when they imagine basketball statistics. This recipe is simplistic and minimizes the complexity of the modeling process; Only a handful of interaction terms are used and no nonlinear trends are addressed. I imagine we will get fairly reasonable predictions with this recipe, but more complexity will likely improve the results. After removing high correlations and zero-variance predictors, we end up with 21 predictors in this recipe. Table 4 shows the predictors that end up used in this recipe:

Table 4: Predictors for general recipe

Interaction

The second recipe will place a heavy focus on interaction terms. There are several predictor variables where we may expect differential effects to occur. For example, Figure 6 visualizes that there is a positive correlation between games started and salaries. Moreover, players who have been in the NBA for 10 or more years have higher salaries at every level along the x-axis. So, we should create a predictor that is the interaction between games played and whether a player has been in the NBA for 10 or more years.

Figure 6

Another key difference between this recipe and the general recipe is the use of percentage variables. For example, instead of using the variable for field goals, I use the variable for field-goal percentage. Here, I am trying to see if differing the form of the variable changes the outcome of the model. I do not expect this change to have significant effects, as we can find these types of variables through combinations of other variables. All in all, the interactions recipe has 33 predictors, as seen in Table 5.

Table 5: Predictors for interaction recipe

Nonlinear

The third recipe decreases the number of interaction terms used and adds nonlinearity complexity. While few, there are predictors that seem to exhibit some nonlinear trends in their distribution, as shown in Figure 5. While the nonlinearity is not extreme, I believe the trend is significant enough to merit a recipe that focuses specifically on this aspect of the data.

Like the interaction recipe, the nonlinear recipe uses statistical percentages in replace of the general predictors when appropriate. The nonlinear recipe has 40 predictors, as seen in Table 6.

Table 6: Predictors for nonlinear recipe

Out-of-Control

The last recipe places focus on predictors that are not completely in the player’s control, such as their team’s market size, their position, and whether or not their team makes the playoffs. Much of my reasoning for creating this recipe comes less from data exploration and more from personal theories. Although they are not directly connected to season statistics, these predictors could easily affect the accuracy of our models. It could be the case that a player who signs a contract for the New York Knicks should expect a much higher salary than if they signed for the San Antonio Spurs. This effect may simply be due to the differences in cost-of-living between various market sizes. It may also be the case that organizations in larger markets are wealthy and therefore can sign players at a market-premium.

This recipe maximizes predictors not necessarily in the control of the players and minimizes variables that are, like points, assists, etc. Neither the percentage nor non-percentage statistics are used to emphasize this minimization. The out-of-control recipe has 29 predictors, as seen in Table 7.

Table 7: Predictors for out-of-control recipe

Model Building & Selection

We will go through the tuning results of each recipe. It is important to note that, for comparison consistency purposes, tuning parameters were not changed relative to the recipe being used.

Tuning Specificities

Before continuing to the results, we should review the values used for each hyperparameter. All parameters use 5 levels. Of course, there is no need to further examine the linear and null models, as these models do not have any tuning parameters associated with them.

The elastic net model uses a mixture of lasso and ridge regressions between 0 and 1, which covers the model leaning fully towards lasso and fully towards ridge. The penalty is placed between .316 and 3.16. Here, I wanted to observe differences in the models with high and low penalties. I use the glmnet engine.

The random forest model uses 1 to 15 predictors for each decision tree, 2 to 4 nodes needed to continue the random forest, and 100 to 1000 trees. Of course, the random forest model should not use all the predictors when creating these nodes. 1 to 15 predictors meet that requirement for each of my recipes. I chose to stay conservative with minimum nodes, as we want lower values to deal with general complexity. The wide range of trees simply captures how performance changes as we increase the complexity (the number of trees) in the model. I use the ranger engine.

The boosted trees model uses 1 to 15 predictors for each tree, 1 to 10 nodes needed to continue the boosted tree, and 100 to 1000 trees. The same reasoning used in the random forest model applies here. I intended for the boosted trees model to also tune the learn rate between .1 and .95. However, I carelessly did not consider the log-transformation attached to this hyperparameter when tuning. So, the results of the boosted tree model are unfortunately mismeasured and not useful to the analysis. At the time of writing this report, there is not enough time to fully tune the boosted trees model on four different recipes. However, appropriate retuning to this model will be done in the future.

The neural networks model uses hidden units between 0 and 5, a penalty between 0 and 1, and a number of training iterations between 100 and 750. Since I observed only a small amount of nonlinearity in the data, I did not want to tune very high values in this model. As explained earlier, increasing all these values decreases the chance of overfitting while making the model more complex. It will be interesting to observe, from a nonlinearity standpoint, how complex this model will have to become before it hurts the results. I use the nnet engine.

The MARS model uses a number of predictors between 1 and 50, and degrees of freedom between 1 and 5. Predictors between 1 and 50 cover all potential sizes for each recipe used. Again, I wanted to stay conservative with nonlinearity complexity in these models since there was not an extreme abundance in my exploration. I use the earth engine.

Individual Model Results

This section compares the performance metrics of each of the four recipes, holding one model constant. Since the null and linear models do not use hyperparameters, these models are saved for the final comparison analysis.

Elastic Net

Table 8 shows the tuning parameters with the top 5 best model performances for elastic net using the general recipe. We can see that the model performs best when the mixture is 0 or close to 0. This result implies that the model strongly prefers the ridge regression concerning regularization. A smaller penalty is also preferred, leading our model closer to a linear regression model (which would have no penalty). In the future, it may be interesting to examine the RMSE when we allow for no penalty. Considering the trend of the data, we should expect a better-performing model.

Table 8: Best performances of elastic net (using general recipe)
penalty mixture .metric mean n std_err
0.3162278 0.00 rmse 1.236227 50 0.0033517
0.5623413 0.00 rmse 1.239372 50 0.0033285
1.0000000 0.00 rmse 1.245986 50 0.0032946
0.3162278 0.25 rmse 1.247118 50 0.0032604
1.7782794 0.00 rmse 1.259097 50 0.0032560

Table 9, Table 10, and Table 11 shows the elastic net model using the interaction recipe, nonlinear recipe, and out-of-control recipes, respectfully. We can see that each of these recipes use the same combination of optimal hyperparameters as Table 8. We can conclude that the hyperparameters do not have varying effects on the elastic net model when changing recipes. However, each of these recipes perform better than the general recipe; The nonlinear model gives a mean of 1.20 and the other two recipes give a mean of 1.11 (though the standard error of the interaction recipe is slightly smaller).

Table 9: Best performances of elastic net (using interaction recipe)
penalty mixture .metric mean n std_err
0.3162278 0.00 rmse 1.106416 50 0.0030110
0.5623413 0.00 rmse 1.116479 50 0.0030272
1.0000000 0.00 rmse 1.134135 50 0.0030533
0.3162278 0.25 rmse 1.139569 50 0.0029839
1.7782794 0.00 rmse 1.162756 50 0.0030894
Table 10: Best performances of elastic net (using nonlinear recipe)
penalty mixture .metric mean n std_err
0.3162278 0.00 rmse 1.198944 50 0.0030584
0.5623413 0.00 rmse 1.206345 50 0.0030633
1.0000000 0.00 rmse 1.217816 50 0.0030666
0.3162278 0.25 rmse 1.233707 50 0.0031586
1.7782794 0.00 rmse 1.235106 50 0.0030694
Table 11: Best performances of elastic net (using out-of-control recipe)
penalty mixture .metric mean n std_err
0.3162278 0.00 rmse 1.111376 50 0.0030613
0.5623413 0.00 rmse 1.130025 50 0.0030808
0.3162278 0.25 rmse 1.141342 50 0.0028242
1.0000000 0.00 rmse 1.157503 50 0.0031255
0.3162278 0.50 rmse 1.179132 50 0.0028970

Random Forest

Table 12 shows the tuning parameters with the top 5 model performances for random forests using the general recipe. We can see that, for this model, the number of predictors dictates the mean of our prediction metric; How many trees and the minimum number of nodes needed in each node seem to only affect the standard error of our models. However, these standard errors are very close to each other.

Table 12: Best performances of random forest (using general recipe)
mtry trees min_n .metric mean n std_err
4 775 3 rmse 1.148740 50 0.0028316
4 1000 3 rmse 1.148826 50 0.0028336
4 1000 2 rmse 1.149000 50 0.0028392
4 550 4 rmse 1.149005 50 0.0028257
4 550 2 rmse 1.149026 50 0.0028662

Table 13 shows the random forests model using the interaction recipe. The number of predictors in the best model increases, which is likely a reflection of the increase of predictors in this recipe in general. The number of trees used maxes out at 1000 trees, and the minimum number of nodes needed remains at 3. The mean of this recipe is much smaller than the general recipe. For future tuning, it would be interesting to see if increasing the number of trees helps or hurts the current model performance.

Table 13: Best performances of random forest (using interaction recipe)
mtry trees min_n .metric mean n std_err
8 1000 3 rmse 1.022242 50 0.0025926
8 1000 2 rmse 1.022255 50 0.0026891
8 550 4 rmse 1.022477 50 0.0026012
8 775 4 rmse 1.022479 50 0.0025933
11 1000 2 rmse 1.022483 50 0.0025902

Table 14 shows the random forests model using the nonlinear recipe. In terms of hyperparameters, the only difference between this recipe and the general recipe is that the nonlinear recipe uses the maximum number of allotted predictors (15). Even with this change, the mean of the nonlinear recipe’s performance metric is equal to the general recipe’s. Moreover, the standard error is worse compared to the general recipe. These observations imply that the nonlinear model does no better than a simplistic model.

Table 14: Best performances of random forest (using nonlinear recipe)
mtry trees min_n .metric mean n std_err
15 775 3 rmse 1.147142 50 0.0031113
15 1000 4 rmse 1.147338 50 0.0030916
15 1000 3 rmse 1.147402 50 0.0030905
11 1000 4 rmse 1.147543 50 0.0031458
15 775 4 rmse 1.147559 50 0.0031176

Table 15 shows the random forests model using the out-of-control recipe. This best performing model uses 4 predictors like the general recipe and 1000 trees like the interaction recipe. Unlike the other recipes, this model uses a minimum of 4 nodes needed to continue random forest. The mean is not as good as the interaction recipe but still a nice improvement from the other two recipes.

Table 15: Best performances of random forest (using out-of-control recipe)
mtry trees min_n .metric mean n std_err
4 1000 4 rmse 1.058490 50 0.0029350
4 1000 2 rmse 1.058576 50 0.0030127
4 775 3 rmse 1.058591 50 0.0029859
4 775 2 rmse 1.058663 50 0.0029830
4 550 3 rmse 1.058735 50 0.0029857

Boosted Tree

As mentioned in the tuning specification section, the learn rate of the boosted tree model was inaccurately tuned. This error greatly hurts the performance of all the recipes on this model.

As one example, Table 16 shows the tuning parameters with the top 5 model performances for boosted trees using the general recipe. We can see that the learn rate is exceptionally high, the exact opposite of what we want if we are trying to minimize overfitting. Since the learn rate is so high, the boosted trees model needs less trees and utilizes less predictors compared to the random forest model. Again, these findings are not an accurate representation of the best performance with my intended hyperparameters. While not ideal, I believe that we have enough model-recipe combinations to drop boosted trees and still maintain robustness in our prediction model.

Table 16: Best performances of boosted trees (using general recipe)
mtry trees min_n learn_rate .metric mean n std_err
1 100 10 1.258925 rmse 1.210426 50 0.0038173
1 100 7 1.258925 rmse 1.215288 50 0.0037886
1 100 3 1.258925 rmse 1.216892 50 0.0032704
1 100 5 1.258925 rmse 1.218620 50 0.0035626
1 100 1 1.258925 rmse 1.220836 50 0.0037519

Neural Network

Table 17 shows the tuning parameters with the top 5 model performances for neural networks using the general recipe. This model prefers the maximum hyperparameters I specified in the tuning process: 5 hidden units, a penalty of 1, and 750 training iterations on the entire training set. Hidden units and penalties seem to strongly dictate the mean of RMSE while the training iterations vary the standard error. Given the model prefers its maximized values, we should consider raising each of these values in future tuning. I expect the performance metric to be better in this setting.

Table 17: Best performances of neural network (using general recipe)
hidden_units penalty epochs .metric mean n std_err
5 1.0000000 750 rmse 1.142152 50 0.0029015
5 1.0000000 425 rmse 1.142714 50 0.0030667
5 1.0000000 262 rmse 1.143042 50 0.0029472
5 1.0000000 587 rmse 1.143234 50 0.0029866
5 0.0031623 750 rmse 1.147479 50 0.0027754

Table 18 shows the neural network model using the interaction recipe. For the best performing model, hidden units and penalty hyperparameters remain at 5 and 1, respectively. The number of training iterations decreases to 587. Like previous models, the mean of RMSE is lower when using the interaction recipe compared to the general recipe.

Table 18: Best performances of neural network (using interaction recipe)
hidden_units penalty epochs .metric mean n std_err
5.00 1 587 rmse 1.023219 50 0.0031808
5.00 1 750 rmse 1.025804 50 0.0028749
5.00 1 425 rmse 1.025949 50 0.0030122
5.00 1 262 rmse 1.028753 50 0.0029015
3.75 1 750 rmse 1.029667 50 0.0029157

Table 19 shows the neural network model using the nonlinear recipe. Following the same trend as found in the random forest model, the best performing model for the neural network model using the nonlinear recipe is the same as the general recipe. Means are the same, and the standard error for the general recipe is slightly lower. It is clear that the nonlinear recipe is not a good candidate for the best performing model.

Table 19: Best performances of neural network (using nonlinear recipe)
hidden_units penalty epochs .metric mean n std_err
5.00 1 750 rmse 1.136992 50 0.0031588
5.00 1 587 rmse 1.137704 50 0.0031409
5.00 1 425 rmse 1.138038 50 0.0030774
3.75 1 425 rmse 1.138801 50 0.0028901
3.75 1 750 rmse 1.139173 50 0.0028453

Table 20 shows the neural network model using the out-of-control recipe. Like the general and nonlinear model, the hyperparamters for the best performing model all max out my tuning parameters. Similar to previous models, the mean of this recipe is the 2nd-best performing (interactions recipe performs better).

Table 20: Best performances of neural network (using out-of-control recipe)
mtry trees min_n .metric mean n std_err
4 1000 4 rmse 1.058490 50 0.0029350
4 1000 2 rmse 1.058576 50 0.0030127
4 775 3 rmse 1.058591 50 0.0029859
4 775 2 rmse 1.058663 50 0.0029830
4 550 3 rmse 1.058735 50 0.0029857

MARS

Table 21 shows the tuning parameters with the top 5 model performances for MARS using the general recipe. This model performs the best with 37 terms and 2 degrees of freedom. These observations imply that it prefers higher numbers of predictors with lower degrees of freedom. Therefore, nonlinear complexity is unnecessary with this recipe, which we should expect to find.

Table 21: Best performances of MARS (using general recipe)
num_terms prod_degree .metric mean n std_err
37 2 rmse 1.140329 50 0.0029700
50 2 rmse 1.140329 50 0.0029700
25 2 rmse 1.140796 50 0.0029810
37 4 rmse 1.141257 50 0.0031326
50 4 rmse 1.141257 50 0.0031326

Table 22 shows the MARS model using the interaction recipe. We can see that the best model here maximizes the allotted values of our hyperparameters. The mean and standard errors are better in this model as well. It seems that the introduction of these new interaction terms creates the need for higher degrees off freedom. Future tuning of the hyperparameters would increase these values to see if performance increases further.

Table 22: Best performances of MARS (using interaction recipe)
num_terms prod_degree .metric mean n std_err
50 5 rmse 1.024031 50 0.0028655
50 4 rmse 1.024463 50 0.0028797
37 5 rmse 1.024540 50 0.0028408
37 4 rmse 1.024954 50 0.0028532
50 3 rmse 1.025352 50 0.0027956

Table 23 shows the MARS model using the nonlinear recipe. We observe that the best hyperparameters mean for this recipe and the general recipe are the same. Furthermore, the standard errors are nearly identical. This observation implies that the additional complexity of the nonlinear recipe is unnecessary when using the MARS model.

Table 23: Best performances of MARS (using nonlinear recipe)
num_terms prod_degree .metric mean n std_err
37 2 rmse 1.136443 50 0.0026910
50 2 rmse 1.136619 50 0.0026993
37 3 rmse 1.137574 50 0.0030162
37 4 rmse 1.137715 50 0.0030430
37 5 rmse 1.137715 50 0.0030430

Table 24 shows the MARS model using the out-of-control recipe. The best model follows the same pattern in terms of 37 terms hyperparameter. Degrees of freedom are set at 4, likely do to the increase in interaction terms in this recipe compared to the general and nonlinear recipes. The mean of this recipe is also second to the interaction recipe.

Table 24: Best performances of MARS (using out-of-control recipe)
num_terms prod_degree .metric mean n std_err
37 4 rmse 1.051653 50 0.0031972
50 4 rmse 1.051653 50 0.0031972
37 5 rmse 1.051653 50 0.0031972
50 5 rmse 1.051653 50 0.0031972
37 3 rmse 1.052096 50 0.0032181

Grouping Together

The analysis above clearly shows that the nonlinear recipe will not be the best recipe for our model; In every model, the more simplistic general recipe works as well or better than the nonlinear model. So we can leave those results out of this analysis. Table 25 show the best RMSE for each model in the general, interaction, and out-of-control recipe. Given the methodology of our project, we need to select the best performing model-recipe combination. When dealing with RMSE, the smaller the value the better.

Table 25: Comparison of best performing models across recipes

Though it may already be apparent from the table that the interaction recipe tends to perform the best, we can visualize the mean and standard errors of each group. Figure 7 visualizes the performance metric of each of our models using the three targeted recipes.

Figure 7

The figure helps us see that many of the models from the interaction recipe produced smaller RMSE values in comparison to other results. Looking at the performance metrics of the interaction recipe, we can see that neural network, MARS, and random forests all have the same minimizing mean with slightly different standard errors. It is important to note that all models using this recipe maxed out on the range of hyperparameter values I used, implying that each of these metrics could have been smaller had we expanded the tuning ranges. To me, choosing the winning model comes down to intrepretablity, simplicity, and time cost. The random forest model is easier to interpret compared to the other two models, as this model mirrors decision-making. Random forest is also more simplistic and cost effective than the nonlinear natural of neural networks and MARS. Therefore, the winning model in my project is the random forest model using the interaction recipe. This finding does not surprise me, as most of the complexity in this model was from interactions between predictors. Knowing this model can capture some nonlinearity, random forest easily interpreted the rather minimal amount of nonlinear terms in the data.

Final Model Analysis

As mentioned in the target variable exploration, I transitioned salaries by the 7th root for normality purposes. While we should analysis the transformed version of the final fitted model, we should also analyze the target variable on the original fitted scale. This analysis will provide us with further interpretation of the results.

Transformation results

The metrics for the root-mean squared error, the mean absolute error, and R-squared are shown in Table 26. The interpretation of RMSE and MAE is that the average difference between observed salaries and the predicted salaries of the random forest model is about .449 and .335, respectively. Of course the nature of our transform makes it difficult to interpret these values in dollar terms. It is not surprising that MAE is greater than RMSE, does not create as large or a penalty for outliers like RMSE does. So, our results may imply the presence of salary outliers. The interpretation of RSQ is that about 93% of the variability of the observed data can be explained by our random forest model, which is a great value to see.

Table 26: Performance metrics of final model (7th-root transformed)
.metric .estimator .estimate
rmse standard 0.4494820
mae standard 0.3351230
rsq standard 0.9301944

Figure 8 gives us a visualization of predicted salaries against actual salaries. While the actual dollar amounts are still unclear, this scatterplot gives a better understanding of accuracy of our model in terms of variability. Estimate errors also seem to vary between the predictions overestimating and underestimating the actual values. Performing an analysis on the original scale should clear up the interpretations.

Figure 8

Original Results

Again, we show the metrics for the root-mean squared error, the mean absolute error, and R-squared are shown in Table 27, this time examining salaries on the original scale. The interpretation of RMSE and MAE is that the average difference between observed salaries and the predicted salaries of the random forest model is about 2,543,555 and 1,363,147 dollars, respectively. Generally speaking, these seem like rather large errors. It is important to consider the fact that NBA players make millions of dollars each season, so seeing errors in th millions is not unimaginable. However, I would have liked to see these errors smaller — closer to 1 million dollars. Our results still imply the presence of salary outliers. The interpretation of RSQ is that about 93.4% of the variability of the observed data can be explained by our random forest model.

Table 27: Performance metrics of final model (original sclae)
.metric .estimator .estimate
rmse standard 2.543555e+06
mae standard 1.363147e+06
rsq standard 9.344827e-01

Figure 9 gives us a visualization of predicted salaries against actual salaries using the original scale. What may actually be reassuring is that estimate errors tend to underestimate the actual values. This pattern occurs especially as salaries increase on the x-axis. If we know that predicted salaries are underestimates of what players typically receive, an NBA player can simply take this prediction and raise its value a 1 or 2 million dollars. If the model predicts smaller salaries, a player should feel more assured that the model’s prediction is not riddled with errors.

Figure 9

Was it Worth it?

Based on our final model analysis, it seems like creative a predicative model was indeed worth the project. While the estimate errors were less than ideal for me, high variability shows that the model fits our data exceptionally well. The model gives us at the least a solid prediction of salaries that a NBA player or agent can adjust when bargaining for contracts. The random forest model did particularly well in predicting salaries because its ability to deal with small nonlinear variations while retaining a bit of simplicity in the process. We did not need the nonlinearity complexity that MARS and neural network brings but we did need more linear complexity that a null/linear model cannot capture, such as overfitting issues.

Conclusion

This project successfully created a predictive model that can estimate NBA seasonal salaries based on season statistics of a player. Much of the project’s work was finding a good level of complexity for a model and a model’s recipe. I found it interesting that even though there are many intricate basketball-related statistics, fairly simple metrics and interactions between them can explain a significant portion of expected salaries. With this model, I imagine the bargaining process of NBA contracts to be easier for a NBA players and agents.

Understanding hyperparameters and how to appropriately select them was certainly the greatest challenge for me. While I believe I did a solid job with my results, there is still room to re-tune a lot of my models. Retuning the models could lead to lower RMSE values, alleviating my dissatisfaction with my value of RMSE in my final fit model. Due to time constraints, those adjustments cannot happen in this report. However, the work in this project is continuous, and I can continue to tune my hyperparamters in the future.

Appendix - EDA

This section provides further information on how predictors in recipes were chosen. This section is not a completely exhaustive EDA, but rather an analysis that highlights what I believe to be the most prevalent observations to justify my decisions. This analysis is performed on the entire training set.

Interactions

Interaction terms between variables are used through my recipes and are a highlight of the interaction recipe. Here, we examine a few more interactions.

Figure 10 shows the box plot distribution of salaries against whether or not a player is an all-star. While the salary values are currently transformed to the 7th root, it is clear to see that being an all-star is correlated with higher salaries.

Figure 10

It may be the case that, within being an all-star, those who play for larger market teams also have higher salaries than all-stars who play for small-market teams. My definition of market size comes from the valuation of NBA teams. The number of teams in each market size is equal (10 teams each). In Figure 11, we find that increasing the market size group within those who are all-stars does increase salaries. We do not see this observation within those who are not all-stars. This analysis implies that NBA teams in large markets are more likely to pay higher salaries to all-star-caliber players.

Figure 11

We can do a similar market-size analysis using players’ longevity. Figure 12 shows the box plot distribution of salaries against whether or not a player has been in the NBA for ten or more years. We see that longevity is positively correlated with seasonal salaries.

Figure 12

Figure 13 incorporates market size into the analysis. The results are near-identical to Figure 11: increasing the market size group within those who have been in the NBA for at least 10 years does increase salaries. In this scenario, we could argue that a large market tends to demand well-known players for entertainment purposes. Players who have been in the NBA for a decade are more likely to be known by all fans. So, they get paid higher amounts to play for these teams.

Figure 13

Simple reasoning can help us conclude that as a player plays in more games during the season, our model should predict that they are receiving higher salaries. Again, the longevity of a player may play a role in this salary as well. Figure 14 shows the scatterplot between games played and salaries, with curves depicting the trends of those who have and have not been in the NBA for at least five years. We can see that players with more longevity are paid higher at every level on the x-axis. However, this observation could just be a result of many young players being on rookie contracts. So even though they are effective as players, rookies have not had the opportunity to adjust their contracts accordingly.

Figure 14

Figure 15 addresses some of this concern by showing a scatterplot between games started and salaries, with curves depicting the trends of those who have and have not been in the NBA for at least ten years. With the ten-year mark, players have had time to get new contracts that reflect their skill levels. Furthermore, it is fair to assume that players who start games are highly valued by their team, implying they should be paid more. Still Figure XX identifies higher salaries for players with more longevity, although the difference between the two groups does seem to be smaller than in Figure 14.

Figure 15

Unusual distributions

In the missingness check, I note observations that have percentage-based predictors that are NA or 0 should be dropped from the dataset. What I do not mention are the cases of observations that have nearly 100% for these predictors. Figure 16 shows the density distribution of field goal percentages. Generally speaking, this graph has a very nice normal distribution with the middle-fifty of players falling between .405 and .485. On the left side of the distribution, we can see the cluster of players with a field goal percentage of 0, which we expected. However, there is a cluster on the right-hand side of the distribution with a near-impossible value of 1.

Figure 16

Table 28 lets us evaluate those with this value further. We see that, like those with 0 or NA percentages, these players made very minimal contributions to their teams during the season. All players in this group played only a handful of games and averaged a relatively small amount of minutes in those games. We can attribute these observations to players who typically only play during “garbage time,” which is when a game is a blowout win/lose and coaches replace usual role players with reserves to give the typical rotation a rest. Ultimately, these players are outliers within our dataset that do not contribute a great deal towards explaining how season statistics predict salaries. Therefore, observations with unusually high field goal percentages (I measure as all those 95% and up), should be dropped from the study.

Table 28: Observations with field goal values of 100%

The same trends persist for all percent-based predictors such as the 3-point percentage in Figure 17 and the effective field goal percentage in Figure 18. So, all unusual percentages are dropped accordingly.

Figure 17
Figure 18

Appendix - Data Wrangling

This section briefly identifies the resources and explains the process of collecting the data used in this project.

Data Sources

When searching for a data source, I wanted data that recorded season statistics and salaries of players over an extended period. Unfortunately, I could not find one data set that had this information. So, I decided to do a bit of data wrangling and merging of data sets.

Salaries

I found a few data sets with yearly NBA salaries for each player on Kaggle. The first data set, by “patrick”, provides salary data from 1996 to 2019. The second data set, by“Fernando Blanco”, provides salary data from 1990 to 2017. I combined these two data sets to get salaries from 1990 to 2019. In addition, I web-scrapped 2020, 2021, and 2022 season salaries from ESPN.

After merging, additional cleaning of players’ names had to be done. I had to distinguish between players who have matching names (EX: There have been multiple players named Charles Smith). Since this salary data and the season stats data do not have any matching ID variables, I had to rely on a composite key of the player’s name and the season for merging. So distinguishing individual players was critical. Additional details of these processes are found in 0_salary_wrangling.R.

Season Statistics

Season statistics were collected through Stathead Basketball’s querying system. Collecting these baseline statistics was not difficult. However, it is obvious that this data set almost entirely has numerical predictors. Thus, the majority of the work done in 0_nba_stats_wrangling.R was adding categorical predictors to the dataset. Some of these predictors include made_playoffs, which indicates whether or not a player played in the playoffs that year, and five_years, which indicates that a player has been in the NBA for five or more years.